Comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches
نویسندگان
چکیده
Bioinformatics applications and pipelines increasingly use k-mer indexes to search for similar sequences. The major problem with k-mer indexes is that they require lots of memory. Sampling is often used to reduce index size and query time. Most applications use one of two major types of sampling: fixed sampling and minimizer sampling. It is well known that fixed sampling will produce a smaller index, typically by roughly a factor of two, whereas it is generally assumed that minimizer sampling will produce faster query times since query k-mers can also be sampled. However, no direct comparison of fixed and minimizer sampling has been performed to verify these assumptions. We systematically compare fixed and minimizer sampling using the human genome as our database. We use the resulting k-mer indexes for fixed sampling and minimizer sampling to find all maximal exact matches between our database, the human genome, and three separate query sets, the mouse genome, the chimp genome, and an NGS data set. We reach the following conclusions. First, using larger k-mers reduces query time for both fixed sampling and minimizer sampling at a cost of requiring more space. If we use the same k-mer size for both methods, fixed sampling requires typically half as much space whereas minimizer sampling processes queries only slightly faster. If we are allowed to use any k-mer size for each method, then we can choose a k-mer size such that fixed sampling both uses less space and processes queries faster than minimizer sampling. The reason is that although minimizer sampling is able to sample query k-mers, the number of shared k-mer occurrences that must be processed is much larger for minimizer sampling than fixed sampling. In conclusion, we argue that for any application where each shared k-mer occurrence must be processed, fixed sampling is the right sampling method.
منابع مشابه
The effects of sampling on the efficiency and accuracy of k−mer indexes: Theoretical and empirical comparisons using the human genome
One of the most common ways to search a sequence database for sequences that are similar to a query sequence is to use a k-mer index such as BLAST. A big problem with k-mer indexes is the space required to store the lists of all occurrences of all k-mers in the database. One method for reducing the space needed, and also query time, is sampling where only some k-mer occurrences are stored. Most...
متن کاملPONTIFICIA UNIVERSIDAD CATOLICA DE CHILE FACULTAD DE MATEMATICAS NONPARAMETRIC BAYESIAN ANALYSIS FOR ASSESSING HOMOGENEITY IN k l CONTINGENCY TABLES WITH FIXED RIGHT MARGIN TOTALS
In this work we postulate a nonparametric Bayesian model for data that can be accommodated in a contingency table with fixed right margin totals. This data structure usually arises when comparing different groups regarding classification probabilities for a number of categories. We assume cell count vectors for each group to be conditionally independent, and with multinomial distribution given ...
متن کاملComparing the asymptotic power of exact tests in 2×2 tables
A 22 table may arise from three types of sampling, depending on the number of previously fixed marginals, and may yield three possible, differing, probabilistic models. From the unconditional point of view each model requires a specific solution but, within each model, the calculation time increases as the test procedure chosen is more powerful, and, between the models, the calculation time de...
متن کاملComputing Matching Statistics and Maximal Exact Matches on Compressed Full-Text Indexes
Exact string matching is a problem that computer programmers face on a regular basis, and full-text indexes like the suffix tree or the suffix array provide fast string search over large texts. In the last decade, research on compressed indexes has flourished because the main problem in large-scale applications is the space consumption of the index. Nowadays, the most successful compressed inde...
متن کاملExact and asymptotic inference in clinical trials with small event rates under inverse sampling.
In this paper, we discuss statistical inference for a 2 × 2 table under inverse sampling, where the total number of cases is fixed by design. We demonstrate that the exact unconditional distributions of some relevant statistics differ from the distributions under conventional sampling, where the sample size is fixed by design. This permits us to define a simple unconditional alternative to Fish...
متن کامل